m LABORATORY FOR COMPUTER SCIENCE w Lk MASSACHUSETTS INSTITUTE OF TECHNOLOGY MIT/LCS/TR-495 THE SPECTRAL NORM OF FINITE FUNCTIONS Mihir Bellare February 1991 545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139 This document for written for my area exam. The assigned papers were [FJS],[KM] and [SB]. The Spectral Norm of Finite Functions Mihir Bellare Laboratory for Computer Science Massachusetts Institute of Technology February 1, 1991 Abstract In many recent results in learning and computational complexity theory which rely on Fourier analysis, the spectral norm plays a key role. An understanding of this quantity would appear to be useful in both gauging and exploiting these results, and in understanding the underlying techniques. This paper surveys various aspects of the spectral norm of finite functions. We consider some of the motivating results as well as both upper and lower bounds on the spectral norm and their relationships to the computational complexity of the function. Some of the results included here are new. In particular, we introduce a general technique for upper bounding the spectral norm of a decision tree over an arbitrary basis. We also extend the learning algorithm of Kushilevitz and Mansour [KM] to mutually independent distributions. Keywords: Fourier series, spectral norm, learning, decision trees. Contents 1 Introduction 2 2 Basics 3 3 Two Motivating Results 4 3.1 Efficient Learnability and the Spectral Norm 4 3.2 Sparse Representations and the Spectral Norm 6 4 Upper Bounds: Functions of Small Norm 7 5 The Spectral Norm of Decision Trees 8 5.1 A General Framework 8 5.2 Some Applications: Decision Trees of Small Norm 10 6 Lower Bounds: Functions of Large Norm 11 6.1 General Functions 11 6.2 Boolean Functions 11 6.3 An Exponential Lower Bound for Majority 12 7 Generalization: g-norms 14 7.1 The g-Framework 14 7.2 Efficient Learnability and the g-Norm 16 7.3 The <7-norm of Decision Trees 18 7.4 L q versus L 18 A Appendix: Proof of the Combinatorial Identity 21 1 Introduction Most functions are hard to learn efficiently. Functions of small spectral norm, say Kushilevitz and Mansour [KM], are an exception: they can be learned in polynomial time. Moreover these functions can be approximated by sparse Fourier series [SB]. They are also com- putable by threshold circuits of very small depth [Br],[HMPST],[BS],[SB]. And so on: the list of their virtues is a long one. Small is indeed beautiful. And that is just one reason to be interested in spectral norms. Fourier Series and the Spectral Norm. The Fourier series of a function / : {0,1}" -» R is obtained by expressing it as a linear combination of certain appropriately chosen basis functions: f = E* 6 {o i}" f( z )Xz where the X z ■ {0, 1}" -► {-1, +1} are the basis functions and the f(z) are real numbers depending on /. The spectral norm of /, denoted L(J), is the sum of the absolute values of the coefficients in this linear combination: L(f) = ]C Z 6{o,i}» l/( z )l- Fourier series have of late proved to be a very useful tool in learning and computational complexity theory. And as the above indicates, the spectral norm often plays a pivotal role. It is not just that small is beautiful. If one examines these results more closely, one realizes that the spectral norm is a kind of "yardstick," its value for a given function measuring the "complexity" of this function with respect to the task at hand. For example, what the [KM] learning result really says is that a function is learnable in time polynomial in its spectral norm. The sparse approximation result [SB] really says that one can approximate a function by a Fourier series having a number of terms proportional to the spectral norm of the function. And so on. The value of the spectral norm of a function emerges as a key to many of its computational properties. In order to exploit results such as those listed above, it becomes important to know more about this value. In particular, which functions have small spectral norm? Which don't? How does this relate to their complexity in various models of computation? These are the kinds of questions on which I will focus. This Report. I begin by describing a couple of the motivating results mentioned above. Next I look at upper bounds on the spectral norm, where what one is really interested in, of course, is to identify the functions of small norm. Here I present two very simple examples. I then look at the spectral norm of decision trees, an interesting case of how the spectral norm of a function can relate to its complexity in some model of computation. In §6 I look at lower bounds, presenting two examples of functions whose spectral norm is close to the highest possible value for any boolean function. Finally, I turn to a generalization of the usual spectral norm which I call the g-norm. Results about the (/-norm being much scarcer in the literature (it is a recent invention of Furst, Jackson and Smith [FJS]) I will spend more time here on developing techniques and applications. I conclude by considering briefly the relative merits of the two norms. What's New. Among the results and proofs included here, some are new. The principal amongst these are • A technique for upper bounding the spectral norm of a decision tree over an arbitrary basis. I introduce a general framework which enables one to compute, in a simple manner, a weight which for any decision tree is an upper bound on its spectral norm. This extends the result of [KM] which applied only to decision trees over the parity basis. As a consequence, many new classes of decision trees become learnable in polynomial time. • An extension of the [KM] learning algorithm to mutually independent distributions. I generalize the proof of [KM] to show that their algorithm works to learn functions in time polynomial in their g-norm when the error of a hypothesis is measured by its proximity to the target under the mutually independent distribution q. • A new proof of an exponential (and almost optimal) lower bound on the spectral norm of the majority function. This is only new in the sense that I have not seen any other proof (however I know that the result is known: I have just been unable to track down the reference). Apologia. It being the first time I have looked at Fourier analysis, I have not hesitated to state the obvious. In order to tie results together and set the stage, I have often had to prove small things which are not in the papers only, I'm sure, because they are considered too basic. I include my proofs anyway. To disambiguate, I have followed the convention of citing in the theorem statement the source from which I took it: if nothing is cited it means I couldn't find it anywhere. Finally, I must apologize for the many things I have omitted. It seemed better to stick to a single thread than to try to compile a compendium of results in the assigned papers, and of course much that is interesting is lost. 2 Basics The set of real valued functions on {0, 1}™ is a 2" dimensional vector space over the reals. Choosing the "right" basis for this vector space yields the Fourier series. Here I'll go through the basics, as briefly as possible. I'll give the definitions and a minimum of the rich set of properties of Fourier series, developing more as the need arises in later sections. Notation and Conventions. I identify a string z € {0, l} n with the subset of {0,1}" whose charecteristic vector is z. With this convention, I will apply set-theoretic notation to strings, using expressions like "i £ z" or "yAz" where y,z are n bit strings and 1 < i < n. The empty string is denoted A. We adopt the (usual) convention that {0, 1}° = {A}. Boolean functions for us mean functions whose range is { — 1,+1}. Fourier Series. We define the inner product of j,g : {0, 1}" ->■ R by (f,g) = 2- n E, e{ o,i } ~/(^M^). Note that (/, g) = E[/#] where the expectation is over the uniform distribution on the inputs. The norm associated to this inner product is ||/||= \f{f,f)- The parity functions are the functions X z ■ {0, 1}" — ► { — 1, +1} defined for z £ {0, 1}" by x.(x) = H(-iy> . {Xz}ze{o,i}" is an orthonormal basis for the vector space of real valued functions on {0,1}" (the orthonormality is with respect to the inner product defined above). Suppose / : {0,1}" -> R. Then / = £, e {o,i}» I{z)X z where f(z) = (f,X,)- This (unique) expansion of / in terms of the parity basis is its Fourier series. The Fourier coefficients of / are the real numbers f(z) (z £ {0, 1}"). The sequence of Fourier coefficients is also called the spectrum of /. The Fourier transform is the operator T defined by F(f) = /. This operator is linear: T(af + g) = aT(f) + T{g) for real a and functions f,g : {0, 1}" -► R. The spectral norm (or just norm) of / is L(f) = E* e {o,i}» 1/00 1- In tne context of norms, "small" means polynomially bounded (as a function of n) and "large" means not small. Parseval's identity E 2EW ]J» 2 =ll/ll 2 follows directly from the orthonormality of the basis. It follows that if / is boolean then £*€{o i}" f( z Y = 1- A consequence of this last fact is the following Proposition 2.1 /// is boolean then L(f) < 2"/ 2 . A useful property of the parity functions is that X a Xi = XaAb- Some Intuition for the Boolean Case. Some intuition about what the Fourier coefficients of a boolean function / measure is provided by Linial, Mansour and Nisan [LMN]. They observe that f(z) = Pr[/(z) = Xz(x)] - Pr[/(z) ^ X 2 (x)] (the probabilities being over a random choice of the input ar). Thus one can think of the Fourier coefficient f(z) as measuring the ability of / to approximate the parity of the bits in z when the input is chosen at random. Matrix Formulation. The following formulation of the Fourier transform in terms of matrices is often useful. The Sylvester type Hadamard Matrices are defined inductively by H — [1] and H n+ i Let [/] denote the vector (of length 2") whose components are the values of f(x) in lexicographic order. Then Proposition 2.2 [Br] Let f : {0,1}" -»■ R. Then [f] = 2- n H n [f}. 3 Two Motivating Results Fourier analysis has of late proved to be a very useful tool in learning and computational complexity theory. And in many recent results, the spectral norm plays a pivotal role. Here are two examples. 3.1 Efficient Learnability and the Spectral Norm Learning is hard. How hard? The spectral norm of the target function, say Kushilevitz and Mansour [KM], gives us some indication. Let us see how. The Model. The learning algorithm A receives as input l",e,<5 and, as an oracle, the function / : {0,1}" -> { — 1,+1} which it is trying to learn (the latter means that as part of its normal A B (i n ,e,6;f) S <r- Coef(l n ,A,e5(n)- 1 ;/) h(-)*-sign(^ zeS f(z)x z ) return h Coef(l n ,a,0;/) { Returns a superset of { a/3 : \f(a{3)\ > } } if a €{0,1}" then return {a} else if E[/2] > 2 then return Coef(l n ,a0, 0; /) U Coef (l n ,al, 0; /) else return Figure 1: The Learning Algorithm Ab operation the algorithm can query the value of / on an input x and receive the answer in one step). The output of A on these inputs is (an encoding of) a hypothesis h : {0, 1}" -> R. The error of the hypothesis with respect to the input / is Err h (f) = Pr[/(z) ^ h(x)] (the probability is over a random choice of a;). We can now say what it means to learn a concept class. Definition 3.1 For each n let C" be a collection of functions from {0, 1}" -► {-1, +1}. We say that the concept class C - U„>i C" is learnable if there exists an algorithm A (of the kind described above) such that Pr[&a ( i», Mi/ )"(/) < e] > 1 - S for every / G C n and every n and e,6 > (the probability is over the coin tosses of the algorithm). Note that this is not distribution free learning in the sense of Valiant [Va]. Not only is the algorithm allowed membership queries but also the error of the hypothesis is measured with respect to the uniform distribution on the inputs. The Result. Let C B = Un>iQ where C n B is the class of functions from {0,1}" -» {-1,+1} whose spectral norm is bounded above by B{n). Theorem 3.2 [KM] Let B : N — > N. Then the concept class C B is learnable in time polynomial in 5(n),n,e _1 and log<5 _1 . Broadly speaking, this says that a concept class is learnable in time polynomial in a bound on the spectral norm of the functions in the class. The Algorithm. The learning algorithm is presented in Figure 1. Let me explain, very briefly, what is happening. The idea is to approximate / by the sum of a small number of "special" terms of its Fourier series. These special terms are those which have a high Fourier coefficient: specifically, one can show that if S C {0,1}" includes all strings z satisfying \f(z)\ > ci?(n)- 1 then Y,zes f( z )X* is a g ood approximation to /. Call z special if \f(z)\ > 0. The main component of the algorithm is a subroutine Coef which finds all the special terms of/: Coef (1", A, 0; /) returns a set of n bit strings containing all the special ones (A is the empty string). Coef works recursively, building up the special strings from the empty string: if a € {0, l} k then Coef (l",a, 0;/) returns (a superset of) the set of all special strings which have prefix a. The routine looks at the function f a : {0, 1}"-* -»■ {-1, +1} defined by f a (x) = E^€{o,i}-* f( a P)X()( x ) : the value of E[/J] is used to prune the search and keep the running time polynomial. Finally, let me remark that the algorithm as stated is somewhat inaccurate in that it assumes we can compute both E[/^] and f(z). The truth, however, is that both can be sufficiently well approximated for our purposes. Proofs? They await §7.2 where I will show something more general. The correctness of the above will follow as a special case. 3.2 Sparse Representations and the Spectral Norm One of the drawbacks of the Fourier series, from a computational point of view, is that it is just too large: the series, after all, has an exponential number of terms. So we seek to approximate functions by sparser Fourier series. But how sparse a series can one get? The answer, again, is in the spectral norm. The result of Siu and Bruck [SB] that I will describe here shows the existence of approximations to a function / by a Fourier series having a number of terms which is polynomial in the spectral norm of/. Applications abound, particularly in the area of threshold circuit complexity [Br], [IIMPST], [BS], [SB]. The Result. The theorem of Siu and Bruck [SB] improves on work in [BS]. The proof I will present here is a slightly modified version of the one in [SB]: I make the probabilistic method more explicit and use Chernoff bounds to avoid the variance calculation. Lemma 3.3 (Chernoff Bound) Let X x , . . . ,X m be independent, mean random variables in the range [-1,1] and letX = YT=i X i- Then Pr[|X| > A] < 2e~ A2/2m for any A > 0. Note: Siu and Bruck [SB] assume the domain of their functions is { — 1, +1}", rather than {0, 1}" as I assume here. So the statement of the following theorem will look a little different from theirs. It is however easy to see that the two formulations are equivalent. Theorem 3.4 [SB] Let /{0, 1}" -> {-1,+1} and let k > 0. Then there exists a set S C {0,1}" and integers M,w z (z G S) such that \f(x) — g(x)\ < n~ k where g : {0,1}" — ► R is defined by 9{x) = jjJ2.es w*Xz(x). Moreover \S\,\M\ are < 0(n 2k+l L(f) 2 ) and w z £ {-1,+1} (z G S). Proof: The proof uses the probabilistic method. Define 17,- = {0,1}" U {-Li} with Prn(z) = / WMB if ze {o,i}" \ 1 - E ae {o,i}» \a a \/B otherwise, where B — \L(f)~\ and a z = f(z). Our sample space will be fi = YiiLi ^» w ^h the induced probability measure. Now define random variables z .j z i z »\ = I sign(a z ,)xA x ) if *'' G {0,1}" I otherwise and let G x = £ Ya=i %,*■ Then Pr [\G X - f(x)\ > n- k ] = Pr ^n 7 Nf(x) B > Nn -k B 2 n 2k + \ and noting that E[Z t> ] = f{x)/B this is < 2e- Nn "Z 852 by Lemma 3.3. Choosing N = 165 this is < 2~" and thus Pr [\G X - f(x)\ < n~ k for all x £ {0, 1}"] > . Thus there is a point (z 1 , . . .,z N ) in our sample space such that \G x (z 1 ,... ,z n ) - f(x)\ < n~ k for all x € {0,1}" and we can conclude the proof by setting g = j;Ya=i sign(a 2 i)x Z ' (actually the sum is only over the i for which z l is in {0, 1}"). ■ The reader is referred to [SB] for applications to the construction of small depth threshold circuits. 4 Upper Bounds: Functions of Small Norm Convinced that functions of small norm are nice functions, the first thing we would like to do is certainly to find examples of them. In this section I present two simple examples. AND and OR. For any z € {0,1}" we define AND,, OR, : {0,1}" -► {-1,4-1} by AND, (a:) = (-l)A^*- 011,(3) = (-i)V.^-. Proposition 4.1 L{ AND,),I( 0R 2 ) < 3. Proof: It is not too hard to compute explicitly the complete spectrum of these functions. Details omitted. ■ Comparison. Sometimes it is possible to show that a function has small norm by exploiting some intrinsic property of it (for example, a convenient inductive structure). The comparison function is an example [SB]. The comparison function C n : {0,1}" —>■ { — 1,+1} is defined for even n by C <xv) - i 1 lix ~ y ^n{xy) - i _ 1 otherwige where x,y € {0,!}"/ 2 are identified with the integers X)?ii x i^~' an d Ya=i Vi^~' respectively. Proposition 4.2 [SB] Let n be even. Then L(C n ) = f + 1. Proof: The proof is by induction on k = n/2. For the base case k = 1 it is easily verified that C 2 = f Xoo + f Xoi - 5X10 + |Xn and hence L(C 2 ) = 2. For k > 1 we note that C 2k+2 {xy) = * — - + — - — L C 2k {x 2 ...x k+1 y 2 ...y k+l ) . It follows that L(C 2k+2 ) = L(C 2k ) + 1. ■ 5 The Spectral Norm of Decision Trees A particularly interesting aspect of spectral norms is how they can relate to the computational com- plexity of the function in some model of computation. The first (to my knowledge) example is the result of [KM] on parity decision trees. Here I present a general framework for upper bounding the spectral norm of an arbitrary decision tree in terms of the spectral norms of the functions in its nodes. Extending the result of [KM], this yields many examples of classes of decision trees of small norm. 5.1 A General Framework The Model. A decision tree is a binary tree in which each node is labeled by a boolean function {0, 1}" -»■ {-1,+1}. The tree computes (or defines) a function / : {0, l} n -» {-1, +1} in the natural way. On any input x € {0, 1}", we start at the root and compute the value at x of the function g by which the root is labeled. We branch according to this value: to the left if it was -1 and to the right if it was 1. We continue in this way until we hit a leaf. Here we compute the value at x of the function g by which this leaf is labeled, and this value is the value of the function / on input x. Note that we do not for now restrict the functions labeling the nodes: any boolean function is allowed. I will identify the decision tree with the function it computes. Thus I will speak of the Fourier transform or the spectral norm of a decision tree, and write things like T{z) or L(T) correspondingly. Similarly I identify a node with the function which labels it and speak of the Fourier transform or norm of a node. Finally, |T| will denote the number of nodes in T. Degrees. We assign to each function a number which we will call its degree. The degrees of the constituent nodes will enable us to measure the spectral norm of a decision tree. The definition itself is very simple: we let deg(g) = \[L{g) + 1]. So the degrees are, strictly speaking, the spectral norms themselves: it is just more convenient to translate slightly. Note that L(g) > 1 for any boolean g by Proposition 6.1. So 1 is also the minimal degree of any boolean function. Weight of a Decision Tree. We will assign to each decision tree a weight which is computed as a function of the spectral norms (or, more precisely, the degrees) of its constituent nodes. The weight w(T) of a decision tree T is defined by induction on the structure of the binary tree T as follows: • If \T\ = 1 then we let w(T) = L(g) where g is the function labeling the (single) node which comprises the tree. • If \T\ > 3 then T is as depicted in Figure 2 and we let w(T) = deg{g)[w{T_ 1 ) + w(Ti)]. Useful Facts. Before we can prove the main theorem we need to develop some facts about the spectral norm. Lemma 5.1 L(fx s ) = L{f) for any function f . Proof: We have fXs(z) = 2- n Z*f(x)Xs(x)x z (x) = 2-"EJ(s)X Ji ,(x) = f(sAz) 0. Figure 2: The decision tree T and the lemma follows from the fact that the map z^^Azisa permutation of {0, 1}". ■ Lemma 5.2 Suppose f,g : {0,1}" — ► R. Then (1) L(f + g)<L(f) + L{g) (2) L(fg)<L(f)L(g). Proof: By the linearity of the fourier transform we have F(f + g){z) = F(f)(z) + T{g)(z). Apply the triangle inequality and then sum over all z to get (1). Let a z — g{z). Then making use of (1) we have Hfg) = L(f-J2 z a zXz ) = L(^ z a z fx z ) < ^2 z L(a z f Xz ) = £ z \a z \L(f Xz ) . But by Lemma 5.1 this equals E>,|£(/) = Z(/)£,k| = Hf)L(g) which establishes (2). ■ Main THEOREM. The main theorem is now very simply stated. It says that the spectral norm of a decision tree is at most its weight. Theorem 5.3 Let T be a decision tree. Then L{T) < w(T). Proof: The proof is by induction on the structure of the tree. The base case of T having only one node is easily verified. Now suppose T has > 3 nodes. Visualize T as depicted in Figure 2. Now observe that T(x) = ^[l-g(x)]T. 1 (x) + ^[l + g(x)]T 1 (x) = \T_,{x) + \T x (x) - \T. x {x)g{x) + ^(x^x) . By Lemma 5.2 we get L(T) < ^L(T. 1 ) + ^L(T 1 ) + ^L(T. 1 )L(g) + ^L(T 1 )L(g) = i [L{g) + 1] UT-r) + \ [L{g) + 1] L(T,) = deg(g)[L(T_ 1 ) + L(T 1 )] . But by induction £(T_i) < w(T_i) and L(T X ) < wfa) so the above is < deg{g)[w(T_ 1 ) + wfc)] = w(T). m 5.2 Some Applications: Decision Trees of Small Norm Usually, we are interested in decision trees all of whose nodes are from some small basis of functions. For example, [KM] consider decision trees where the nodes all compute parity functions. In many of these special cases, the weight of the tree is easy to compute. Theorem 5.3 can then be applied to obtain a useful upper bounds on the spectral norm of the tree. This technique is quite general, and one can use it to bound the spectral norm of trees with all kinds of combinations of functions in the nodes. Here I illustrate with just two examples. Parity Decision Trees. Let us begin by deriving the result of [KM] that the spectral norm of a parity decision tree is bounded above by the number of nodes in the tree (a parity decision tree is a decision tree each of whose internal nodes is labeled by a parity function X, and each of whose leaves is labeled +1 or —1). Corollary 5.4 [KM] Let T be a parity decision tree. Then L(T) < \\T\/2\. Proof: By the main theorem it suffices to show that the weight of a parity decision tree T is < [|T|/2] . The important observation for this is that deg{X s ) = 1- The fact that w(T) < \\T\/2] is now easily established by induction using the definition of w. ■ Characterization of Degree 1 Functions. We note that the only fact about the parity functions that we used in the above is that deg(x s ) = 1- Thus the theorem extends to decision trees each of whose nodes is labeled by a function of degree 1. As the following characterization of degree 1 functions shows, however, these decision trees are actually identical to parity decision trees. Proposition 5.5 Suppose g is boolean. Then deg(g) - 1 if and only if ' L(g) = 1 if and only if g - ±X S for some s. Proof: It is clear that deg(g) — 1 if and only if L(g) = 1, and it is also clear that L(g) = 1 if g - ±X,- To complete the proof it suffices to show that if L(g) = 1 then g - ±X S for some s. So suppose L{g) — 1. Then we have 1 = \9(x)\ = IE,ff(*)X,(*)| < E,»,(i)l = L(g) = 1 and thus | J2 Z d( z )X z (x)\ = £* \g(z)x z {x)\. It follows that for each x either g(z)x z (x) > for all z or g(z)x z {x) < for all z. Moreover the former happens when g{x) = 1 and the latter when g(x) — -1, because g(x) = Y, z 9( z )Xz{x)- So fixing any z such that g{z) ■£ (at least one such must exist) we can conclude that sign(g(z)) - X z (x) if g(x) - 1 and sign{g{z)) = -X z {x) if g{x) = -1. But this is true for all x, and thus g is either x z or —X z - ■ So the parity functions and their negations are exactly the boolean functions of minimal degree. Computing the Weight. Before proceeding to more applications let us derive a more useful for- mulation of the weight function. Namely, w ( t ) = e n <mo ■ leaves I ancestors i of I That is, the weight is the sum, over all leaf to root paths, of the product of the degrees along these paths (we are identifying a node with the function labeling it, so that deg(i) means the degree of the function labeling node i). AND AND OR. We looked at the spectral norm of the AND and OR functions in §4, and by Proposition 4.1 we know that deg(AND 2 ),deg(0R z ) < 2. So these functions are just a step higher than parity in the degree hierarchy, and the natural choice of an addition to the basis to get a richer class of decision trees. Corollary 5.6 Let T be a decision tree over the basis {x,, A~ND„ OR, : s e {0, 1}"} with the property that any root to leaf path has at most 0(log n) AND, or OR s nodes. Then L(T) < iTlrc ^. Proof: The weight is easily bounded using the above expression. ■ I don't have space for more examples, but these should have sufficed to illustrate quite well how the main theorem can be applied. 6 Lower Bounds: Functions of Large Norm I now turn to lower bounds on the spectral norm. I'll first (briefly) consider arbitrary real valued functions and then move on to the real case of interest: boolean functions. 6.1 General Functions Let's begin with the following observation. Proposition 6.1 L(f) > max r \f(x)\ for any function f : {0, 1}" — ► R. Proof: For any x it is the case that L{f) = E,l/(*)l = EJ/(*)x,(*)l > IE,/(*)x,(*)| = |/(*)|. ■ This means that for a function to have polynomially bounded spectral norm, its output must be at most 0(log n) bits long. Many simple and interesting functions are thus ruled out. Addition of two n bit numbers is one such. One quickly realizes however, that the above is not really a restriction. In most applications (learning is an example) the output can be viewed bit by bit. That is, we view the function as a concatenation of boolean functions, and consider the spectral norm of each of these boolean functions separately. So we should concentrate on boolean functions. 6.2 Boolean Functions It is probably not surprising that Theorem 6.2 Most boolean functions have large spectral norm. We can view this as a corollary of the fact that functions of small norm are efficiently learnable (Theorem 3.2). A direct proof is not as obvious as it might seem (and I haven't seen one). Can we construct a specific example of a boolean function of large norm? The answer is yes. The function is the "inner product mode 2" (it is "mode," not "mod") defined for even n by n/2 in(x) = n(-ir*- iA ** «' = 1 10 Theorem 6.3 [MS], [Br] Let n be even. Then L{I n ) = 2"/ 2 . Proof: This proof uses the matrix representations of Fourier transforms that we introduced in §2. We claim that [J„] is an eigenvector of H n with eigenvalue 2" /2 . Proposition 2.2 then implies that \h\ - 2 _n/2 [/ n ], and the result follows. The fact that [/„] is an eigenvector of H n with eigenvalue 2" /2 is easily established by induction on k = n/2. ■ This proof is from [Br] who however cites [MS] for the result. We note that by Proposition 2.1 the bound of Theorem 6.3 is tight. A few other simple examples of functions of exponential spectral norm are known [Br],[BS]. The proofs in all cases are based on inductive ideas like the ones above (and indeed the functions are constructed to have enough inductive structure to make these proofs go through). I'll now consider a somewhat different kind of boolean function: majority. 6.3 An Exponential Lower Bound for Majority The majority function M n : {0, 1}" —>■ {-!,+!} is defined by v ' —1 otherwise. 2 What's Known. Siu and Bruck [SB] show that the spectral norm of M n is not polynomially bounded. Their proof relies on Theorem 3.4 and results from [Br] and [HMPST]. The results from [Br] and [HMPST] are first used to show that M n is not approximable (to within o(rc _1 ) error) by any sparse Fourier series. That L(M n ) is not polynomially bounded then follows from Theorem 3.4. Assuming this was the best known ([SB] did not indicate otherwise) I proved the exponential lower bound which appears as Theorem 6.4 below. Later I read [BOH] in which, in a different context, appear equations for the complete spectrum of the majority function. A lower bound at least as good as Theorem 6.4 is easily derived from this. They attribute the spectrum to [Ka]. I was however unable to find [Ka] anywhere, and thus do not know the proof of the result stated in [BOH]. I should note, though, that [SB], [BOH] all consider only the case of n being odd (they are interested in majority as a linear threshold function and the case of n even is inconvenient because Yli x > could be 0). My proof applies (only) to n even. Before going on it is also worth noting that techniques along the lines of those used in the proof of Theorem 6.3 would not appear to be useful here because the majority function has no clear inductive structure. The Lower Bound. The following exponential lower bound is within a polynomial factor of optimal: Theorem 6.4 Let n > 1 be a multiple of 4. Then 4f>-i/" L(M n ) > -=— 2"/ 2 . V27rn Let's proceed to the proof. 11 The attack is pretty direct. What I will do is compute explicitly the "middle" Fourier coefficients (that is, those corresponding to the parity over a size ra/2 subset of the variables), and get a bound based on these alone. A Combinatorial Identity. I will need the fact that This is probably known, although I haven't been able to find a reference. For completeness, I provide in Appendix A a simple proof via generating functions. Computing the Middle Fourier Coefficients. The following lemma gives explicit values for the middle Fourier coefficients of the majority function. Lemma 6.5 Let z £ {0, l} 4 * have exactly 2k ones. Then (^,^>=2- 4i W(-l) 1 . Proof: We begin by observing that the value of the fourier coefficient (M 4k ,X z ) depends only on the number of ones in z. So without loss of generality we can set z = l 2k Q 2k . Now let A = 2 4k (M 4k ,X z )- We claim that 2k *-£$<-"• (i) whence the result follows by our combinatorial identity. To establish Equation (1) we begin by noting that M 4k (x)x z (x) = (-1)* M ik (l i 2h - i V Q 2k ~ 3 ) where i is the number of ones in the left half of x and j is the number of ones in the right half of a;. It follows that A = EE ( 2 -) ( 2 ,*V-i)'M 4t (r o»-W*->) 2k ! = E l 2k )(-i) j 2k E j = 2k-i 2V 2fc-t'-l /np E j=0 The i = k term of the above sum equals ( 2 fc *) (-l) k . Let B be the remaining part of the sum. Noting that ( 2 -)(-iy = ( 2 2 k k _ l )(- 1 ¥ k ~ i we can rewrite B k-l £ft<-" 2k E j=2k-i '2k y as 2fc— i— 1 2k t'-l - E '*)+±(*VW i=° j=o Using the fact that ( 2k ) = (^ .) the term in square brackets simplifies to E 2k; 2k E j = i+l 2k 2k\ ™ 2k' i-1 E i=o r 2fc N = 2 ^2^^ 12 and putting all this together we get as desired. ■ Proof of Theorem 6.4. The proof of Theorem 6.4 follows easily now. First let us recall the bound f 2n\ e- 1 / 6 " i > ^=^2 2n (2) (which is obtained using Stirling's formula). Now let k - n/4. Then L(M 4k ) = Yl* \( M 4k,X z )\ > J2\z\=2k \(M<ik,Xz)\, and this by Lemma 6.5 and Equation (2) is at least as desired. 7 Generalization: g-norras I now turn to a generalization of the basic Fourier series and the corresponding spectral norms (q- norms). Results about the g-norm being much scarcer in the literature (it is a recent invention of Furst, Jackson and Smith [FJS]) I will spend more time here on developing techniques and applications. In particular, I present a new application. The motivation, once again, comes from learning. One of the drawbacks of the [KM] learning algorithm is that the quality of the hypothesis is measured by its proximity to the target under one particular fixed distribution: the uniform one. Can we move any closer to some kind of distribution free learning? History, luckily, provides a parallel. Linial, Mansour and Nisan [LMN] had used Fourier analysis to develop an algorithm to learn AC functions under the uniform distribution. Furst, Jackson and Smith extended this result to what they called mutually independent distributions. In order to do this they developed a more general Fourier transform in which the basis is defined in terms of the mutually independent distribution q. The tools and framework introduced by [FJS] provide an avenue to similarly generalize the [KM] learning algorithm. Measuring the quality of a hypothesis by its proximity to the target under q, we will be able to learn in time polynomial in the g-norm. Here I will present this generalization of [KM] and then try to say something about g-norms and their relation to the usual norm. 7.1 The g-Framework I must begin by developing the machinery. Most of the work, in fact, is here: once things have been properly set up the lemmas of [KM] can be generalized in a pretty straightforward manner. Mutually Independent Distributions. A probability distribution q : {0,1}" ->■ [0,1] is mutually 13 independent if the random variables xi, . . . ,x„ are independent. We will be assuming that < Pr[x,- = 1] < 1 for each i. We say that q is polynomially bounded if 1 1 max Ki<n + .Pr[ii = 1] Pr[«,- =0]J is polynomially bounded (as a function of n). g-FouRlER Series. Let q be mutually independent. We define the inner product of f,g : {0, 1}" — > R by Note that (f,g) q = E[/#] where the expectation is over q. The norm associated to this inner product is ||/||,= v/(/?/)j- Let fj, { = Pr[x,- = 1]. Note that /x,- is the mean of the random variable a;,-. Let the standard deviation \//Xj(l — Hi) of x,- be denoted by cr,-. For z, a; £ {0, 1}" we define cj>(z,x) = n * g . * • We call 4> the basis associated to q because of the following Proposition 7.1 [Ba],[FJS] Let q : {0,1}" — > [0,1] 6e mutually independent and let <f> be as defined above. Then {<f>(z, -)}^e{o,i}" is an orthonormal basis for the vector space of real valued functions on {0,1}" (the orthonormality is with respect to the inner product {•,•)$,). Note that (f>(z, •) = Xz when q is the uniform distribution. To get a feel for these definitions, let's check the orthonormality. For a,b e {0, 1}" we have <#a, ■),#&,■)>, = E n^n Hi = E tea "■ jet J and the mutual independence implies that this equals n i£aC\b Hi J, i n i G a A b l^i % i He zGanfe fJ'i & % n e H'i %i <Ji Now each term of the first product is 1 while each term of the second is 0, and thus the above expression is 1 if a = b and otherwise. The Fourier series of / : {0,1}" — > R is £^e{o i}» f( z )<f>( z i x ) where f{z) — (f,<p(z,-)) q . The (/-spectral norm (or just g-norm) is L q (f) = X^e{o n- |/(^)|. Parseval's identity as usual says that Z^e{o,i}» f( z ) = ll/llj- I'm abusing notation a little by denoting by f(z) both the Fourier coefficients under q and the (usual) Fourier coefficients with respect to {Xz}. The context should usually make it clear which I mean, and if both are around I will disambiguate. The Shifted Distribution and Basis. With minor changes the above is the framework of [FJS]. I now have to go a little further. 14 Let q be mutually independent and <f> the associated basis. Suppose 1 < k < n. For y e {0, 1}* and < j < n - k we define q^y) = Vi[x j+l = y u .. .,x j+k = y k ]. Observe that qj is a mutually independent distribution on {0,1}*. For notational convenience we identify q with q. For x, z € {0, 1}* and < j < n - k we define &(*>*) = n H'j-^-i i We also define <^(A, A) = 1 for all j = 0, . . . ,n. For notational convenience we identify (f> with cj>. The fact that ^ is a mutually independent distribution on {0, 1}* implies, by Proposition 7.1, that {(f>j(z,-)}z€{o,i} k is an orthonormal basis for the vector space of real valued functions on {0, l} k (the orthonormality is with respect to the inner product {•, -} 9j ). We'll denote by E ; the expectation over qj. A Switching Lemma. One very useful property of the ^-basis is that for any a,b G {0,1}* and any j = 0, . . . ,n - k. 7.2 Efficient Learnability and the g-Norm Fix a mutually independent distribution q. The only change in the learning model is that the proba- bility in the definition of the error Err h (f) = Pr[/(ar) / h(x)] is now taken over q (rather than over x chosen at random). The Result. Let C B (q) = \J n >i Cb(<i) where C B (q) is the class of functions from {0,1}" ->■ {-1,+1} whose spectral g-norm is bounded above by B(n). Theorem 7.2 Let B : K —>■ N. Let q be a polynomially bounded, mutually independent distribution. Then the concept class C B (q) is learnable in time polynomial in i?(n),ra,e _1 and log 8~ l . Note that when q is the uniform distribution this reduces to the [KM] result. The algorithm is shown in Figure 3. Again, one can check that when q is the uniform distribution it is just the algorithm of Figure 1. To show that it works, I now generalize the three basic lemmas of [KM]. I'll state and prove the lemmas without elaborating overly; the discussion in §3.1 should suffice to see why these lemmas imply the correctness of the algorithm, and for even more detail the reader is referred to [KM]. The Lemmas. Fix a mutually independent distribution q and its associated basis <f>. Let / : {0, 1}" -» R (note we are not requiring / to be boolean). Suppose < k < n - 1 and a € {0,1}*. Define /„ : {0, 1}"-* -+ R by f a (x) = E^o.i}-* /(«/?)^(/?, x). The first lemma shows that it suffices to find the terms of sufficiently large Fourier coefficient. The proof remains identical to the one in [KM]. 15 5*-Coef(l n ,A,€5(n)- 1 ;/) h(-)^-sign(j: zeS f(z)cj>(z,.)) return h Coef(l n ,a,0;/) { Returns a superset of { a/3 : |/(a/3)| > } } if a€{0,l} n then return {a} else Let k be the length of a if Ejt[/2]>0 2 then return Coef(l n ,aO, 0; /) U Coef (l n , al, 0; /) else return Figure 3: The Generalized Learning Algorithm AB,q Lemma 7.3 Let e be > and suppose S D {z : \f(z)\ > eL q {f)~ l }. Let g(x) = Exgs f(*)<K z > x )- Then E[(/ - g) 2 } < e. Moreover if f is boolean then Pr[/(a;) ^ sign (#(x))] < e. Proof: E[(/-<?) 2 ] = E 2£M »[/W-sWF = T., ts K*?, and the last of these is at most max|/(^)|.E, e{ o,i } n|/WI < eL^f)- 1 ■ L q (f) = e. For boolean / we note that Pr[/(s) f sign(g(x))] < Pr[|/(a;) - g(x)\ > 1] < E[(/ - gf]. ■ The next lemma is used to bound the search time. I state it slightly more generally than [KM] so that the case of non-boolean functions is covered. Lemma 7.4 At most a \\f\\ 2 q -2 fraction of the functions f a (a <E {0, l} k ) have ~E k [f 2 ] > 2 • Proof: We observe that E E *[/«] = £</«>/«>,* = E(e^/(«w*(^-),e /j 7(«W*(/', a a a which by the bilinearity of the inner product and the orthonormality of the basis is £/(«0) 2 =11/11? • a/3 The lemma follows. ■ Note that ||/|| ? < L q (f) and hence if £,(/) is polynomially bounded then so is ||/|| ? . Finally we have to show how to approximate E k [f a ]. It can be done by sampling provided we can approximate /„ itself. By equating /„ with an expectation the following lemma provides the avenue to obtain the necessary approximation to it. Lemma 7.5 f a (x) = E[f(-x)4>(a, •)]. 16 Proof: Making liberal use of the identities in §7.1 we have f a ( X ) = y,K«p)W,x) = J2 f(y z )M a i y)i(y z ) J2 'M/ 3 ' z )fa{Pi x ) yz = Y^f(y z )M a ,y)qo(y)\-r7-; (4>k(. z r),M x r)) qk = ^2f{yx)<i>o{u,y)(io{y) y = B[f(-x)4>(a,.)]. M Learning Non-Boolean Functions. The restriction that the target function be boolean is not really necessary. For the general case we change the definition of the error to Err h (f) = E[(/ — h) 2 ] (the expectation, as usual, being over 5). We also require the target functions to have polynomially bounded output length under some encoding of the output into bits. Finally the algorithm of Figure 3 is modified to output the function </(•) = Ylz^s f{ z )4>z{-) rather than just its sign. 7.3 The g-norm of Decision Trees The results of §5 can also be generalized by appropriately extending the definition of the degree. Namely, let q be mutually independent and g : {0, 1}" — > R. Then the degree of g with respect to q, denoted deg q (g), is defined to be the least positive real number d with the property that L q (fg) < (2d - l)L t (g) for all / : {0,1}" - {-1, +1}. It should be noted that when q is the uniform distribution this definition of the degree coincides with the one in §5. However I suspect that in general it is not true that deg q (g) = \[L q (g) + 1]. The weight w q (T) of a decision tree T is then be defined by induction as before, using the new definition of the degree, and the proof of the main theorem can be extended to show that L q [T) < w q (T) for any decision tree T. I omit the details. 7.4 L q versus L Results such as the above motivate us to understand the relationship between the usual spectral norm and the ^-spectral norm. In particular, can the change of basis and distribution drastically reduce the spectral norm? Proposition 7.6 There is a function of exponential spectral norm which has polynomial spectral q- norm with respect to some polynomially bounded, mutually independent q. Proof: Let q be defined by Pr[x,- = 1] = | for each i = 1, . . . , n. Let (f> be the associated basis. Then L((f>(z,-)) > 2>\ z \l 2 (this follows from Proposition 6.1 since |</!>(z,0 n )| = S^'/ 2 ). So for z having, say, a constant fraction of ones, the (usual) spectral norm of cf>(z, •) is exponential. 17 ■-•*»»M««* .l^a»^ The q-noim of #*, •), on the other hand, is of coarse jast I ww* «K*, •) is a basis function. ■ So #*, •) is not karaahte by the original algorithm ktdtk learaod hy the generalized oae. PaOitlM. Fiad am example el a aaeiestt faactioa / tad a aolyaoeaially boaaded, omtaaUy indepen- dent q sach that L(f) m set poiyacwaiaUy boaaMted Vat £^/> ie. I doat kaow how hard thk »; I hama't had tiaws to try. 18 Acknowledgements Thanks to Ron Rivest for suggesting (and writing) a lisp routine for evaluating J2i=o ( ; ) (~^)' — I wouldn't have got the answer without it! References [Ba] Bahadur, R,. "A Representation of the Joint Distribution of Respones to n Dichotomous Items," Studies in Item Analysis and Prediction (edited by H. Solomon), Stanford Uni- versity Press (1961). [BOH] Brandman, Y., A. Orlitsky and John Hennessy, "A Spectral Lower Bound Technique for the Size of Decision Trees and and Two-Level and/or Circuits," IEEE Transactions on Computers 39(2), 282-287 (February 1990). [Br] Bruck, J., "Harmonic Analysis of Polynomial Threshold Functions," SIAM J. Discrete Math. 3(2), 168-177 (1990). [BS] Bruck, J. and R. Smolensky, "Polynomial Threshold Functions, AC Functions, and Spec- tral Norms," Proceedings of the 31st Annual IEEE Symposium on the Foundations of Computer Science, IEEE (1990). [FJS] Furst, M., J. Jackson and S. Smith, "Learning AC Functions Sampled under Mutually Independent Distributions," Manuscript (October 1990). [HMPST] Hajnal, A., W. Maass, P. Pudlak, M. Szegedy and G. Turan, "Threshold Circuits of Bounded Depth," Proceedings of the 28th Annual IEEE Symposium on the Foundations of Computer Science, IEEE (1987). [Ka] Karpovsky, M., "Finite Orthogonal Series in the Design of Digital Devices," Wiley (1976). [KM] Kushilevitz, E. and Y. Mansour, "Learning Decision Trees using the Fourier Spectrum," Manuscript (November 1990). [LMN] Linial, N., Y. Mansour and N. Nisan, "Constant Depth Circuits, Fourier Transform, and Learnability," Manuscript (January 1991). Preliminary version in Proceedings of the 30th Annual IEEE Symposium on the Foundations of Computer Science, IEEE (1989). [MS] Mac Williams, F. and N. Sloane, "The Theory of Error- Correcting Codes," North- Holland (1978). [SB] Siu, K. and J. Bruck, "On the Power of Threshold Circuits with Small Weights," Manuscript. [Va] Valiant, L., "A Theory of the Learnable," Communications of the ACM 27(11), 1134-1142 (1984). 19 A Appendix: Proof of the Combinatorial Identity Here I give a proof of the combinatorial identity used to derive the lower bound on the spectral norm of the majority function. Let's begin by recalling the identity: E( 2 *)ViV = ("!)* ^ 2k k 2k\ . 7 _ /2k\ The proof uses generating functions. We define a,- = ( 2 *) and b { = ( 2 { )( — 1)' and let Let Yli^o c i x% ^ e *^ e formal power series for G. By definition of G we have Ik 2k f 9h \ 2 : = i = V * / So it suffices to show that c 2k = (-l)*^*)- To do this we compute G(x) explicitly. By the binomial theorem £~ O' = (1 + xf* and £,~ (?)(-l)'V = (1 - a;) 2 *, so G(x) = (l + a;) 2 *(l-a:) 2 * = (1 - z 2 ) 2 * . Applying the binomial theorem again we get g(*) = £( 2 *)(- 1 ) < « w » and thus c 2 k = ( — 1)*( 2 /) a s desired. 20